Intro to research toolbox

Jeremy Lew

July 15, 2024

Topics

  1. Introduction & learning plan
  2. Getting comfortable with R
  3. Programming fundamentals
  4. Data cleaning
  5. Data visualisation
  6. Statistical analysis
  7. Presentation of results

Introduction & learning plan

Motivation

  1. Added tools to your toolbox
  2. Transferrable skill
  3. Learning how to program in 1 language helps you learn other languages
  4. Leverage on open source tools/projects written by others

Learning plan

  1. The topics in this deck mirror the workflow of a project

flowchart LR
  D[Data exported\nfrom REDCAP]-->C[Data cleaning]
  C-->S[Statistical analysis]
  S-->P[Presentation\nof results]

  1. The game plan is to first do a quick pass through from data cleaning to analysis:
    1. Take a toy dataset
    2. “Clean” the data (minimally)
    3. Run a regression model
  2. Once we have gotten comfortable with R, we will explore each topic in greater detail
    1. Data cleaning
    2. Programming fundamentals
    3. Analysis: generalised linear models
    4. Presentation of results

Resources

  1. R courses
  2. The Epidemiologist R Handbook
  3. R for Data Science
  4. Advanced R
  5. Cheat Sheets

Getting comfortable with R

What is an R package?

  1. It is a collection of code/data written by someone and packaged for distribution
  2. Hosted on the Comprehensive R Archive Network and/or Github for public download
  3. From the CRAN webpage, we are able to find a reference manual documenting the data/functions of the package
  4. From the Github repository, we are able to see the actual codes implemented by the package’s author(s)
  5. To install a package we would run install.packages("palmerpenguins") in the console
  6. To load a package, we would run library(palmerpenguins)
  7. To uninstall a package we would run remove.packages("palmerpenguins") in the console

[DEMO] Intro to RStudio workspace

  1. Console: where you run/execute lines of code

  2. To assign a variable, we use the <- operator

    x <- 1:10; print(x)
     [1]  1  2  3  4  5  6  7  8  9 10
    1. random-access memory is allocated when a variable is assigned/declared
    2. random-access memory is freed-up when the variable is deleted (or garbage collected if it goes out of scope)
  3. Environment: where you’re able to see variables stored in random-access memory

    1. To clear variables from environment \(\rightarrow\) execute rm(list = ls()) in console

[DEMO] Intro to RStudio workspace

  1. Script
    1. Writing code and saving the script
    2. Execute single/multiple lines of code \(\rightarrow\) highlight code + Ctrl-enter
    3. Runs from top to bottom
    4. Comments will not be executed
  2. Hotkeys
    1. Clear the console \(\rightarrow\) Ctrl-l
    2. Clear a line in console \(\rightarrow\) Ctrl-d

[PRACTISE] Tasks to do

  1. For this task, you will need the palmerpenguins and tidyverse packages. Install them if you have not already done so
  2. We will make use of the penguins dataset from palmerpenguins package
  3. Understand the data
  4. Dichotomise body_mass_g (continuous) into a categorical variable with 2 categories (“light”: <= 4750g and “heavy”: > 4750g)
  5. Run a linear regression1 with body_mass_g as the dependent variable and independent variables: species, bill_length_mm, bill_depth_mm, flipper_length_mm and sex
  6. Run a logistic regression with the categorical body mass variable you created earlier as the dependent variable, with the same independent variables as before

Programming Fundamentals

Data types

  1. Character data type e.g. "a"

    typeof("a")
    [1] "character"
  2. Integer data type e.g. 1L

    typeof(1L)
    [1] "integer"
  3. Numeric data type e.g. 5.0

    typeof(5.0)
    [1] "double"
  4. Logical data type e.g. TRUE or FALSE

    typeof(TRUE)
    [1] "logical"
  5. Complex data type e.g. 1 + 4i

    typeof(1 + 4i)
    [1] "complex"

Data types (continued)

  1. Factor datatype for working with categorical variables

    x <- factor(sample(c("Malay", "Others", "Indian", "Chinese"), 10, replace = T),
                levels = c("Chinese", "Malay", "Indian", "Others"))
    print(x)
     [1] Indian  Malay   Malay   Chinese Chinese Chinese Malay   Malay   Others 
    [10] Others 
    Levels: Chinese Malay Indian Others
    print(levels(x))
    [1] "Chinese" "Malay"   "Indian"  "Others" 
  2. A factor variable is more useful than a plain character vector. E.g. the first level will be used by glm as the reference category

Data structures (vectors)

  1. Data structures are containers for holding data

  2. A character vector holds multiple characters

    letters
     [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
    [20] "t" "u" "v" "w" "x" "y" "z"
    typeof(letters)
    [1] "character"
  3. A numeric vector holds multiple numbers

    x <- 1:5; print(x)
    [1] 1 2 3 4 5
    typeof(x)
    [1] "integer"

Exercise

  1. Can you query the length of your character vector? length(letters)
  2. Can you slice the character vector to get the first 5 elements? letters[1:5]
  3. Can you slice the character vector to get the 1st, 9th, 20th elements? letters[c(1, 9, 20)]
  4. Can you flip the character vector from last to first element? letters[length(letters):1]

Data structures (lists)

  1. Vectors can only hold 1 kind of data type e.g. characters, numeric etc
  2. Lists are “containers” that can hold multiple data types
    1. Unnamed lists

      list("a", 1, 10:15)
      [[1]]
      [1] "a"
      
      [[2]]
      [1] 1
      
      [[3]]
      [1] 10 11 12 13 14 15
    2. Named lists

      list(a = "a", b = 1, c = 10:15)
      $a
      [1] "a"
      
      $b
      [1] 1
      
      $c
      [1] 10 11 12 13 14 15

Data structures (lists, continued)

Exercise

  1. Create a variable in your console: mylist <- list(a = "a", b = 1, c = 10:15)
  2. Can you query the length of mylist? length(mylist)
  3. Can you get the second element of your mylist using
    1. index: mylist[1]
    2. key: mylist$a or mylist[["a"]]
  4. Compare the difference between mylist[3] and mylist[[3]]
  5. Compare the difference between class(mylist[3]) and class(mylist[[3]])

Data structures (dataframe)

  1. Dataframe is a tabular container that can hold multiple data types like lists, but each column can only store data of the same data type

  2. To view the first 2 rows of the penguins dataset from the palmerpenguins package

    library(tidyverse)
    library(palmerpenguins)
    penguins %>% head(2)
    # A tibble: 2 × 8
      species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
      <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
    1 Adelie  Torgersen           39.1          18.7               181        3750
    2 Adelie  Torgersen           39.5          17.4               186        3800
    # ℹ 2 more variables: sex <fct>, year <int>
  3. To view the last 2 rows of the dataset

    penguins %>% tail(2)
    # A tibble: 2 × 8
      species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
      <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
    1 Chinstrap Dream            50.8          19                 210        4100
    2 Chinstrap Dream            50.2          18.7               198        3775
    # ℹ 2 more variables: sex <fct>, year <int>
  4. To view the data in rstudio, execute view(penguins)

Data structures (dataframe, continued)

Exercise

  1. Can you query the dimensions of the penguins dataset using dim(penguins), ncol(penguins), nrow(penguins)?
  2. Can you get a glimpse of the data using the functions glimpse(penguins)?
  3. Can you view summary statistics of the data using summary(penguins)?

Data structures (dataframe, accessing variables)

  1. To get the column/variable-names of your dataset

    colnames(penguins)
    [1] "species"           "island"            "bill_length_mm"   
    [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
    [7] "sex"               "year"             
  2. To access any column variable, we use the $ syntax

    penguins$species[1:10]
     [1] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
    Levels: Adelie Chinstrap Gentoo
  3. To get a table count of a variable or a cross-table count of 2 variables

    table(penguins$sex, useNA = "ifany")
    
    female   male   <NA> 
       165    168     11 
  4. See Data Cleaning for more ways to work with dataframes

Data structures (dataframe, loading data)

  1. To import data from csv file, use read_csv
  2. To import data from an excel file, use read_excel
  3. To import data from SAS/SPSS/Stata or export from R to these file formats, use the haven package

Data structures (dataframe, adding labels)

  1. To add labels to a dataframe, we make use of the labelled package

Functions

  1. A good chapter on explaining functions is Chapter 6, Advanced R, where most of the examples in the slides are taken

  2. Example function

    f01 <- function(x, y) {
      # A comment
      x + y
    }
  3. Components of a function

    1. Arguments \(\rightarrow\) inputs to the function (there can also be functions with no arguments)
    2. Body \(\rightarrow\) the code inside the function
    3. Environment \(\rightarrow\) a list-like structure that stores/maps the variable name to value (a namespace as some people call it)
  4. Functions are objects that can be assigned or they can be anonymous functions

Function invocation

  1. How do we invoke the function?

    print(f01)
    function(x, y) {
      # A comment
      x + y
    }
    f01(3, 5)
    [1] 8
  2. How do we invoke one function after another?

    1. Suppose we have another function f02

      f02 <- function(a, b) {
        a * 10 + b
      }
    2. To call f01 first followed by f02 and sqrt we will “nest” the functions (inside-out, right to left)

      sqrt(f02(f01(1, 2), 5))
      [1] 5.91608

Function invocation (continued)

Another way to invoke a series of functions is to use the pipe operator %>% provided by the magrittr package

library(magrittr)
value <- 1
value %>%
  f01(2) %>%
  f02(5) %>%
  sqrt()
[1] 5.91608

Exercise

  1. What is the difference between f02 and f02(1, 2)?

Function output

  1. What is returned by a function? The last evaluated expression

  2. How do we store values outputted from a function? With assignment statment

  3. Early termination of a function with a return statement

Functions - argument passing

  1. Pass by copy vs pass by reference

  2. In R, all functions are pass by reference if you don’t modify the object subsequently within the function. i.e. copy-on-modify

  3. How to pass arguments?

    f04 <- function(x, y, z) {
      (x / y) * z
    }
    1. By position e.g. f04(10, 50, 20)

    2. By name e.g. f04(z = 20, y = 50, x = 10)

    3. Unpacking multiple named arguments using do.call

      do.call(f04, list(x = 10, y = 50, z = 20))
      [1] 4

Functions - lexical scoping, name masking

Names defined inside a function mask names defined outside a function

x <- 10
y <- 20
f05 <- function() {
  x <- 1
  y <- 2
  c(x, y)
}

f05()
[1] 1 2

Functions - lexical scoping, looking one level up

R will look up a variable in the enclosing scope (e.g. within the function) and if it’s not found, will continue to proceed upwards (e.g. enclosing function or global environment) to look for the variable until it is found

x <- 1
f06 <- function() {
  y <- 2
  i <- function() {
    z <- 3
    c(x, y, z)
  }
  i()
}

f06()
[1] 1 2 3

Exercise

Using the sample function, change the z variable inside the function and observe that…

z <- 10
f07 <- function(x, y) {
  # make your edit here
  z <- z * 100
  z + x + y
}

…there is no change to the z variable outside the function in the global environment

Functions - lexical scoping, dynamic lookup

R looks for the values when the function is run, not when the function is created

z <- 10
f08 <- function(x, y) {
  z + x + y
}

z <- 100
f08(5, 10)
[1] 115

Functions

Loops

Debugging

Memory Management

Data Cleaning

Common functions - packages

  1. These functions are from various packages, conveniently collected in the tidyverse package
  2. The best package for data wrangling is dplyr. Useful info can be found in this chapter
  3. A package for working with dates is lubridate
  4. A package for working with strings is stringr. This is useful for cleaning free-text response data
  5. A package for working with factor datatype is forcats

Common functions - dplyr::select

The select function enables you to select a subset of the columns in your dataset


# library(tidyverse) # if not already loaded
# library(penguins) # if not already loaded
penguins %>%
  select(species, island, bill_length_mm) %>%
  head(3)
# A tibble: 3 × 3
  species island    bill_length_mm
  <fct>   <fct>              <dbl>
1 Adelie  Torgersen           39.1
2 Adelie  Torgersen           39.5
3 Adelie  Torgersen           40.3


# there is also ends_with
penguins %>%
  select(starts_with("bill")) %>%
  head(3)
# A tibble: 3 × 2
  bill_length_mm bill_depth_mm
           <dbl>         <dbl>
1           39.1          18.7
2           39.5          17.4
3           40.3          18  


penguins %>%
  select(matches("mm")) %>%
  head(3)
# A tibble: 3 × 3
  bill_length_mm bill_depth_mm flipper_length_mm
           <dbl>         <dbl>             <int>
1           39.1          18.7               181
2           39.5          17.4               186
3           40.3          18                 195

Common functions - dplyr::filter

The filter function enables you to select a subset of the rows that meet a certain criteria

penguins %>%
  filter(body_mass_g > 3500) %>%
  head(3)
# A tibble: 3 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>
penguins %>% filter(sex == "female")
# A tibble: 165 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.5          17.4               186        3800
 2 Adelie  Torgersen           40.3          18                 195        3250
 3 Adelie  Torgersen           36.7          19.3               193        3450
 4 Adelie  Torgersen           38.9          17.8               181        3625
 5 Adelie  Torgersen           41.1          17.6               182        3200
 6 Adelie  Torgersen           36.6          17.8               185        3700
 7 Adelie  Torgersen           38.7          19                 195        3450
 8 Adelie  Torgersen           34.4          18.4               184        3325
 9 Adelie  Biscoe              37.8          18.3               174        3400
10 Adelie  Biscoe              35.9          19.2               189        3800
# ℹ 155 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Common functions - dplyr::mutate

The mutate function enables you to add columns to your dataset. The added columns can be derived from existing column(s)

penguins %>%
  mutate(body_mass_100g = body_mass_g / 100) %>%
  select(body_mass_g, body_mass_100g) %>%
  head(5)
# A tibble: 5 × 2
  body_mass_g body_mass_100g
        <int>          <dbl>
1        3750           37.5
2        3800           38  
3        3250           32.5
4          NA           NA  
5        3450           34.5
penguins %>%
  mutate(bill_length_plus_depth_mm = bill_length_mm + bill_depth_mm) %>%
  select(matches("bill")) %>%
  head(5)
# A tibble: 5 × 3
  bill_length_mm bill_depth_mm bill_length_plus_depth_mm
           <dbl>         <dbl>                     <dbl>
1           39.1          18.7                      57.8
2           39.5          17.4                      56.9
3           40.3          18                        58.3
4           NA            NA                        NA  
5           36.7          19.3                      56  

Common functions - dplyr::group_by, summarise

  1. The summarise function enables you to get summary statistics like N, mean, median etc

    penguins %>% summarise(N = n(),
                           mean_body_mass_g = mean(body_mass_g))
    # A tibble: 1 × 2
          N mean_body_mass_g
      <int>            <dbl>
    1   344               NA
  2. The group_by function enables you to get summary statistics within groups

    penguins %>%
      group_by(sex) %>%
      summarise(N = n(),
                mean_body_mass_g = mean(body_mass_g),
                median_body_mass_g = median(body_mass_g))
    # A tibble: 3 × 4
      sex        N mean_body_mass_g median_body_mass_g
      <fct>  <int>            <dbl>              <dbl>
    1 female   165            3862.               3650
    2 male     168            4546.               4300
    3 <NA>      11              NA                  NA

Common functions - dplyr::arrange

The function arrange enables us to sort by a certain variable

penguins %>%
  group_by(sex, year) %>%
  summarise(mean_body_mass_g = mean(body_mass_g)) %>%
  arrange(desc(year))
# A tibble: 9 × 3
# Groups:   sex [3]
  sex     year mean_body_mass_g
  <fct>  <int>            <dbl>
1 female  2009            3874.
2 male    2009            4521.
3 <NA>    2009              NA 
4 female  2008            3888.
5 male    2008            4632.
6 <NA>    2008            4650 
7 female  2007            3821.
8 male    2007            4479.
9 <NA>    2007              NA 

Common functions - dplyr::if_else, case_when

The functions if_else and case_when are often used with mutate to create new variables

set.seed(2)
penguins %>%
  mutate(bill_length_type = if_else(bill_length_mm >= 48.5, "long bill", "short bill")) %>%
  sample_n(4)
# A tibble: 4 × 9
  species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
1 Chinstrap Dream            43.5          18.1               202        3400
2 Gentoo    Biscoe           43.6          13.9               217        4900
3 Gentoo    Biscoe           48.1          15.1               209        5500
4 Gentoo    Biscoe           46.8          14.3               215        4850
# ℹ 3 more variables: sex <fct>, year <int>, bill_length_type <chr>
penguins %>%
  mutate(bill_length_type = case_when(bill_length_mm >= 48.5 ~ "long bill",
                                      bill_length_mm < 48.5 ~ "short bill",
                                      TRUE ~ NA_character_)) %>%
  select(bill_length_type) %>%
  table(useNA = "ifany")
bill_length_type
 long bill short bill       <NA> 
        87        255          2 

Common functions - forcats::fct_relevel

Common functions - forcats::fct_collapse

Common functions - stringr::str_detect

Common functions - dplyr::pivot_wider & longer

Common functions - dplyr::bind_rows, bind_cols

Common functions - dplyr::mutate + across

Data visualisation

The ggplot2 package is well known for plotting figures

Histograms

penguins %>%
  ggplot(aes(x = flipper_length_mm, fill = species)) + geom_histogram(bins = 40)

Facet wrap

penguins %>%
  ggplot(aes(x = flipper_length_mm, fill = sex)) +
    geom_histogram() +
    facet_wrap(~species, dir = "v", scale = "free_y")

Analysis: Descriptive statistics

Analysis: Bivariate tests for independent observations

Variable 1 Variable 2 Bivariate test
Categorical Categorical 1. Chi-square test
2. Fisher’s exact test
Categorical Continuous Parametric:
1. Independent t-test (2 categories)
2. One-way independent ANOVA (>2 categories)
Categorical Continuous Non-parametric:
1. Mann-Whitney test (2 categories) a.k.a. Wilcoxon rank-sum test
2. Kruskal-Wallis test (>2 categories)
Continuous Continuous Parametric: Pearson’s correlation coefficient
Continuous Continuous Non-parametric: Spearman’s correlation coefficient

Analysis: Bivariate tests for repeated measurements

Variable 1 Variable 2 Bivariate test
Categorical Categorical time
(e.g. baseline, 12-month)
McNemar’s test
Continuous Categorical time
(e.g. baseline, 12-month)
Parametric: Dependent t-test
Continuous Categorical time
(e.g. baseline, 12-month)
Non-parametric: Wilcoxon signed-rank test

Measures of central tendency

  1. Mean (SD), median (IQR) helps you get a sense of the distribution of characteristics in your study population with respect to levels of the outcome
  2. The tableby function of the arsenal package lets you do this easily
  3. We will talk about improving upon the formatting of the table in presenting tables
library(arsenal)

1tableby(species ~ sex + island + bill_length_mm,
        data = penguins,
        control = tableby.control(numeric.stats = c("Nmiss", "meansd", "medianq1q3", "range"))) %>%
summary(text = TRUE) %>%
knitr::kable(align = c("l", rep("r", 5)))
1
tableby function invoked by these 3 lines
Adelie (N=152) Chinstrap (N=68) Gentoo (N=124) Total (N=344) p value
sex 0.976
- N-Miss 6 0 5 11
- female 73 (50.0%) 34 (50.0%) 58 (48.7%) 165 (49.5%)
- male 73 (50.0%) 34 (50.0%) 61 (51.3%) 168 (50.5%)
island < 0.001
- Biscoe 44 (28.9%) 0 (0.0%) 124 (100.0%) 168 (48.8%)
- Dream 56 (36.8%) 68 (100.0%) 0 (0.0%) 124 (36.0%)
- Torgersen 52 (34.2%) 0 (0.0%) 0 (0.0%) 52 (15.1%)
bill_length_mm < 0.001
- N-Miss 1 0 1 2
- Mean (SD) 38.791 (2.663) 48.834 (3.339) 47.505 (3.082) 43.922 (5.460)
- Median (Q1, Q3) 38.800 (36.750, 40.750) 49.550 (46.350, 51.075) 47.300 (45.300, 49.550) 44.450 (39.225, 48.500)
- Range 32.100 - 46.000 40.900 - 58.000 40.900 - 59.600 32.100 - 59.600

Measures of central tendency (repeated measures)

  1. Suppose we have repeated measures of HbA1c and LDL from patients at e.g. baseline and 12-month

    generate_fake_patient_data <- function(id) {
      data.frame(list(
        patient_id = id,
        time = c("baseline", "12-month"),
        hba1c = runif(n = 2, min = 4, max = 14),
        ldl = sample(c("good control", "bad control"))
      ))
    }
    
    set.seed(2024)
    data <- map(
      sprintf("P%s", str_pad(1:30, width = 2, pad = 0)),
      generate_fake_patient_data
    ) %>%
      bind_rows()
    print(head(data, 10))
       patient_id     time     hba1c          ldl
    1         P01 baseline 12.369425 good control
    2         P01 12-month  7.208675  bad control
    3         P02 baseline  8.570092 good control
    4         P02 12-month 11.014203  bad control
    5         P03 baseline 12.765881  bad control
    6         P03 12-month  5.190482 good control
    7         P04 baseline 11.035531  bad control
    8         P04 12-month  9.157855 good control
    9         P05 baseline  5.103111  bad control
    10        P05 12-month 12.632152 good control
  2. The paired function from arsenal package helps us tabulate pre-post tests easily

    # pacman::p_load(arsenal, kableExtra)
    
    table <- paired(
      time ~ signed.rank(hba1c) + ldl,
      data = data,
      id = patient_id,
      control = paired.control(digits = 2,
                               numeric.stats = c("Nmiss2", "meansd", "medianq1q3", "range"),
                               numeric.test = "signed.rank",
                               signed.rank.correct = FALSE))
    
    table %>%
      summary(text = TRUE) %>%
      kable(align = c("l", rep("r", 4))) %>%
      column_spec(column = 1, width_min = "4cm") %>%
      column_spec(column = 2:4, width_min = "3cm") %>%
      column_spec(column = 5, width_min = "2.5cm")
    12-month (N=30) baseline (N=30) Difference (N=30) p value
    hba1c 0.730
    - N-Miss 0 0 0
    - Mean (SD) 8.76 (3.09) 8.45 (3.09) -0.31 (4.78)
    - Median (Q1, Q3) 8.66 (6.05, 11.78) 7.81 (5.80, 11.30) -0.29 (-3.74, 2.44)
    - Range 4.21 - 13.73 4.01 - 12.77 -9.12 - 7.58
    ldl 0.855
    - bad control 16 (53.3%) 14 (46.7%) 16 (100.0%)
    - good control 14 (46.7%) 16 (53.3%) 14 (100.0%)
    tests(table) %>%
    kable()
    Group Variable p.value Method
    time hba1c 0.7303420 Wilcoxon signed rank exact test
    time ldl 0.8551321 McNemar’s Chi-squared test with continuity correction

Analysis: Generalised Linear Models (GLM)

How are categorical variables represented?

  1. Categorical variables are characters, but we need a numeric matrix in order to perform computations to get the \(\beta\) coefficients
set.seed(10)
idx <- sample(1:344, 5)
penguins[idx,]
# A tibble: 5 × 8
  species   island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>     <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie    Dream               35.6          17.5               191        3175
2 Chinstrap Dream               50.7          19.7               203        4050
3 Adelie    Torgersen           39.7          18.4               190        3900
4 Gentoo    Biscoe              43.2          14.5               208        4450
5 Gentoo    Biscoe              47.2          13.7               214        4925
# ℹ 2 more variables: sex <fct>, year <int>
  1. So one-hot encoding / dummy encoding will be applied to categorical variables
model.matrix(~species, penguins)[idx,]
    (Intercept) speciesChinstrap speciesGentoo
137           1                0             0
330           1                1             0
72            1                0             0
211           1                0             1
271           1                0             1

Framework

flowchart LR
  A[Model specification] --> B[Inference] --> C[Diagnostics]

  1. Model specification
    1. How to decide what type of model to fit?
    2. How to decide what variables to choose?
  2. Inference
    1. How do we get the value of the coefficients, confidence intervals, p-values?
  3. Diagnostics
    1. Does the model fit our data well?

Why the need for GLM?

  1. We want to find the association between an outcome (e.g. BMI) and some dependent variables (e.g. age, gender, ethnicity, social economic status)
  2. We are used to fitting a multivariable linear regression \[ \begin{align} \underbrace{BMI}_{\text{continuous}} &= \underbrace{\beta_0 + \beta_1age + \beta_2gender + \beta_3ethnicity + \cdots + \epsilon}_{\text{this linear combination is continuous}} \\ \epsilon &\sim N(0, \sigma^2) \end{align} \]
  3. What if your outcome is not continuous?
    1. E.g. outcome takes on values of 0 or 1 (logistic regression)
    2. E.g. outocme takes on discrete integer values 0, 1, 2, … (poisson regression)
  4. GLM gives you a way to relate the outome (which can take on various distributions) with a linear combination of dependent variables

Formal definition of GLM

GLM consists of 2 components

Component 1: What distribution does your outcome, \(y_i\) take on?

\[\begin{align} y_i|X_i \sim \text{member of exponential family} \ \text{(e.g. $y_i$ is normally distributed)} \end{align}\]

  1. \(i\) indexes the \(i\)-th observation of your dataset i.e. corresponds to a particular row of your dataset
  2. \(y_i\) represents the \(i\)-th observation of your outcome variable
  3. \(X_i\) represents the \(i\)-th observation of all your independent variables (e.g. age, gender, ethnicity)

Component 2: What is the link function \(g\) you want to use?

\[\begin{align} g(\mathbb{E}(y_i|X_i)) = X_i^{T} \beta = \underbrace{\beta_0 + \beta_1 age_i + \beta_2 gender_i + \beta_3 ethnicity_i + \cdots}_{\text{this linear combination is continuous}} \end{align}\]

  1. \(g\) is the link function that maps \(\mathbb{E}(y_i|x_i)\) to \(X_i^{T}\beta\)
  2. It is the link function that enables the mapping of the continuous \(X_i^{T}\beta\) to \(y_i\), which can be discrete for example
  1. In a generalised linear model, the outcome variable \(y_1,...,y_n\) are modelled as independent and identically distributed (iid) observations

  2. The outcome variable is treated as a random variable (i.e. outcome takes on a certain distribution e.g. normal), but NOT the independent variables

Family member 1: Linear regression

When our outcome, \(y_i\) is continuous and takes on real values (i.e. \(y_i \in \mathbb{R}\)), we may choose to model \(y_i\) with a normal distribution

Component 1: \(y_i|x_i\) is normally distributed

\[ y_i | x_i \sim \mathcal{N}(X_i^{T}\beta, \sigma^2) \]

Component 2: link function is the identity function i.e. no transformation done

\[ \mathbb{E}(y_i|x_i) = X_i^{T}\beta = \beta_0 + \beta_1age_i + \beta_2 gender_i + \beta_3 ethnicity_i + \cdots \]

Running a linear regression in R

  1. Notice the use of the formula syntax. LHS of the ~ is the dependent variable, while RHS are the independent variables
  2. Notice the family argument contains the 2 components of the GLM we discussed earlier
model <- 
  glm(body_mass_g ~ species + bill_length_mm +
                    bill_depth_mm + 
                    flipper_length_mm + sex,
      data = penguins,
      family = gaussian(link = "identity"))

summary(model)

Call:
glm(formula = body_mass_g ~ species + bill_length_mm + bill_depth_mm + 
    flipper_length_mm + sex, family = gaussian(link = "identity"), 
    data = penguins)

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -1460.995    571.308  -2.557 0.011002 *  
speciesChinstrap   -251.477     81.079  -3.102 0.002093 ** 
speciesGentoo      1014.627    129.561   7.831 6.85e-14 ***
bill_length_mm       18.204      7.106   2.562 0.010864 *  
bill_depth_mm        67.218     19.742   3.405 0.000745 ***
flipper_length_mm    15.950      2.910   5.482 8.44e-08 ***
sexmale             389.892     47.848   8.148 7.97e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 82563.33)

    Null deviance: 215259666  on 332  degrees of freedom
Residual deviance:  26915647  on 326  degrees of freedom
  (11 observations deleted due to missingness)
AIC: 4723.9

Number of Fisher Scoring iterations: 2

Family member 2: Logistic regression

When our outcome, \(y_i\), is discrete and dichotomous (i.e. take two discrete values, 0-1, success-failure), we can model \(y_i\) with a Bernoulli distribution

Component 1: \(y_i|x_i\) is Bernoulli distributed

\[ y_i | x_i \sim \mathrm{Bernoulli}(\pi_i) \]

  • \(\pi_i\) is the probability of success of \(y_i\)
  • \(\mathbb{E}(y_i|x_i) = \pi_i\)

Component 2: Link function, \(g\), is the logit function

\[\begin{align} g(\mathbb{E}(y_i|x_i)) &= \underbrace{\mathrm{ln} \left( \frac{\mathbb{E}(y_i|x_i)}{1-\mathbb{E}(y_i|x_i)} \right)}_{logit[\mathbb{E}(y_i|x_i)]} = X_i^{T}\beta = \underbrace{\beta_0 + \beta_1 age_i + \beta_2 gender_i + \beta_3 ethnicity_i + \cdots}_{\text{this linear combination is continuous}} \end{align}\]

Running a logistic regression in R

  1. To be filled only after completion of exercise
  2. Code is largely similar to linear regression except for:
    1. family = binomial(link = "logit")

Inferring the coefficients

  1. Maximum likelihood estimation (MLE) is the way to infer the beta coefficients

  2. The idea behind MLE is that we want to maximise the joint probability of our observed data assuming that the data was generated according to our model i.e. we want to maximise the likelihood function \(\mathcal{L}(\beta) = P(y_1, y_2, ..., y_n)\).

  3. In the case of logistic regression, the likelihood function \(\mathcal{L}(\beta)\) is: \[\begin{align} \mathcal{L}(\beta) &= P(y_1, y_2, ..., y_n) \\ &= \prod_{i=1}^{n} P(y_i) \\ &= \prod_{i=1}^{n} \pi_i^{y_i} (1-\pi_i)^{1-y_i} \\ \end{align}\]

  4. Convert the likelihood to log-likehood to make it easier to work with (we can do that because log is a monotonic function) \[\begin{align} \mathrm{ln}\ \mathcal{L}(\beta) % &= \mathrm{ln} \prod_{i=1}^{n} \pi_i^{y_i} (1-\pi_i)^{1-y_i} \\ % &= \sum_{i=1}^{n} \left[ y_i\mathrm{ln}\pi_i + (1-y_i)\mathrm{ln}(1-\pi_i) \right] \\ % &= \sum_{i=1}^{n} \left[ y_i \mathrm{ln} \left( \frac{e^{\beta^{T} x_i}}{1+e^{\beta^{T} x_i}} \right) + (1-y_i) \mathrm{ln} \left( 1-\frac{e^{\beta^{T} x_i}}{1+e^{\beta^{T} x_i}} \right) \right] \\ % &= \sum_{i=1}^{n} \left[ y_i \beta^{T}x_i - y_i \mathrm{ln}(1+e^{\beta^{T}x_i}) + (1-y_i) \mathrm{ln} \left( \frac{1}{1+e^{\beta^{T}x_i}} \right) \right] \\ % &= \sum_{i=1}^{n} \left[ y_i \beta^{T}x_i - y_i \mathrm{ln}(1+e^{\beta^{T}x_i}) - (1-y_i) \mathrm{ln} (1 + e^{\beta^{T}x_i} ) \right] \\ % &= \sum_{i=1}^{n} \left[ y_i \beta^{T}x_i - \mathrm{ln}(1+e^{\beta^{T}x_i}) \right] \\ \end{align}\]

  5. In order to maximise the log-likelihood, we will use the iteratively reweighted least squares (IRLS) algorithm which is a Newton-Raphson like method \[\begin{align} \beta^{(t+1)} = \beta^{(t)} - \textit{H}^{-1}(\beta^{(t)}) \nabla_{\beta} \mathcal{L}(\beta^{(t)}) \end{align}\] The idea behind this is to update \(\beta\) via a series of iterations such that on every update, we are stepping in the direction of steepest ascent.

  6. To use the IRLS, first we need to calculate the gradient of the log-likelihood \[\begin{align} \nabla_{\beta} \mathrm{ln}\ \mathcal{L}(\beta) &= \sum_{i=1}^{n} \left[ y_i \nabla_\beta \beta^{T}x_i - \nabla_\beta \mathrm{ln}(1+e^{\beta^{T}x_i}) \right] \\ % &= \sum_{i=1}^{n} \left[ y_i x_i - \frac{e^{\beta^{T}x_i}}{1+e^{\beta^{T}x_i}} x_i \right] \\ % &= \sum_{i=1}^{n} (y_i - \pi_i) x_i \\ & = X^{T}(y - \pi) \end{align}\]

  7. Next, we will calculate the Hessian, \(H\), of the log-likelihood by taking derivatives one more time from the gradient. \[\begin{align} H &= -\sum_{i=1}^{n} \pi_i (1-\pi_i) x_i x_i^{T} \\ &= - X^{T} W X \\ \end{align}\]

    where \(W\) is a diagonal weight matrix given by: \[\begin{align} W = diag\{\pi_1 (1-\pi_1), \pi_2 (1-\pi_2), ..., \pi_n (1-\pi_n)\} \end{align}\]

  8. Now that we have the components, we can substitute these back into (5) to expand it out \[\begin{align} \beta^{(t+1)} = \beta^{(t)} + (X^{T}W^{(t)}X)^{-1}X^{T}(y - \pi^{(t)}) \end{align}\]

Interpret coefficients: logistic regression

  1. Step 1: Start from the equation linking outcome to indepenent variables \[\begin{align} \mathrm{ln}\left(\underbrace{\frac{\mathbb{E}(y_i|x_i)}{1-\mathbb{E}(y_i|x_i)}}_{odds_i}\right) = \beta_0 &+ \beta_1 age_i + \beta_2 gender^{(male)}_i \\ &+ \beta_3 ethnicity^{(malay)}_i + \beta_4 ethnicity^{(indian)}_i + \cdots \end{align}\]

  2. Step 2: Analyse unit changes

    1. Continuous example: 1 unit increase in age, with everything else held constant. The exponent of \(\beta_1\) gives us the odds ratio for every unit increase in age. \[\begin{align} \mathrm{ln}\ odds_i|_{age=a+1} - \mathrm{ln}\ odds_i|_{age=a} &= \beta_1 (a+1) - \beta_1 (a) \\ \mathrm{ln}\ \frac{odds_i|_{age=a+1}}{odds_i|_{age=a}} &= \beta_1 \\ \frac{odds_i|_{age=a+1}}{odds_i|_{age=a}} &= e^{\beta_1} \end{align}\]

    2. Categorical example: compare to reference category, with everything else held constant. The exponent of \(\beta_3\) gives us the odds ratio of malay compared to chinese. \[\begin{align} \mathrm{ln}\ odds_i|_{ethnicity=malay} - \mathrm{ln}\ odds_i|_{ethnicity=chinese} &= \beta_3 \\ \mathrm{ln}\ \frac{odds_i|_{ethnicity=malay}}{odds_i|_{ethnicity=chinese}} &= \beta_3 \\ \frac{odds_i|_{ethnicity=malay}}{odds_i|_{ethnicity=chinese}} &= e^{\beta_3} \end{align}\]

Family member 3: Gamma regression

When our outcome \(y_i\), is continuous and positive, we can model \(y_i\) with a Gamma distribution

Component 1: \(y_i\) is Gamma distributed

\[ y_i | X_i \sim Gamma(\alpha, \beta) \]

Component 2: One choice of the link function, \(g\), is the ln function

\[ \begin{align*} g(\mathbb{E}(y_i|X_i)) = \mathrm{ln}\ \mathbb{E}(y_i|X_i) = \beta_0 + \beta_1 age_i + \beta_2 gender_i + \beta_3 ethnicity_i + \cdots \\ \end{align*} \]

Notes:

  1. The inverse of the ln function is the exponent function. Alternatively, we can also express \(\mathbb{E}(y_i|x_i) = e^{\beta_0 + \beta_1 age_i + \beta_2 gender_i + \beta_3 ethnicity_i + \cdots}\)

  2. From this expression, we can see that the inverse of the link function (i.e. exponent) maps \(X_i^{T}\beta\) to \(\mathbb{E}(y_i|X_i)\). And we know that this is a valid function because \(\mathbb{E}(y_i|X_i)=e^{X_i^{T}\beta} > 0\)

Interpret coefficients: gamma regression

  1. Step 1: Start from the equation linking outcome to independent variables \[\begin{align} \mathrm{ln}\ \mathbb{E}(y_i|X_i) = \beta_0 &+ \beta_1 age_i + \beta_2 gender^{(male)}_i \\ &+\beta_3 ethnicity^{(malay)}_i + \beta_4 ethnicity^{(indian)}_i + \cdots \\ % \end{align}\]
  2. Step 2: Analyse unit changes
    1. Continuous example: 1 unit increase in age, with everything else held constant. The percentage change in mean outcome for every unit increase in age can be given from \(e^{\beta_1} - 1\) \[\begin{align} \mathrm{ln}\ \mathbb{E}(y_i|X_i)|_{age=a+1} - \mathrm{ln}\ \mathbb{E}(y_i|X_i)|_{age=a} &= \beta_1 (a+1) - \beta_1(a) \\ % \mathrm{ln}\ \frac{\mathbb{E}(y_i|X_i)|_{age=a+1}}{ \mathbb{E}(y_i|X_i)|_{age=a}} &= \beta_1 \\ % \frac{\mathbb{E}(y_i|X_i)|_{age=a+1}}{ \mathbb{E}(y_i|X_i)|_{age=a}} -1 &= e^{\beta_1} - 1 \end{align}\]
    2. Interaction terms example: compare vs reference category, age at some value \(a\), with everything else held constant
      1. The percentage change in mean outcome for male compared to female can be given from \(e^{\beta_2 + \beta_3 a} - 1\)
      2. Notice now that this percentage change which was \(\beta_2\) (if no interaction had been defined) is now modified by \(\beta_3 a\) which depends on age \[\begin{align} \mathrm{ln}\ \mathbb{E}(y_i|X_i)|_{age=a, gender=male} - \mathrm{ln}\ \mathbb{E}(y_i|X_i)|_{age=a, gender=female} &= (\beta_1 a + \beta_2 + \beta_3 a) - \beta_1 a \\ % \mathrm{ln}\ \frac{\mathbb{E}(y_i|X_i)|_{age=a, gender=male}}{ \mathbb{E}(y_i|X_i)|_{age=a, gender=female}} &= \beta_2 + \beta_3 a \\ % \frac{\mathbb{E}(y_i|X_i)|_{age=a, gender=male}}{ \mathbb{E}(y_i|X_i)|_{age=a, gender=female}} -1 &= e^{\beta_2 + \beta_3 a} - 1 \end{align}\]

Family member 4: Conway-Maxell-Poisson (CMP) regression

  1. When our outcome variable is a “count”, the poisson regression model is used because the poisson distribution models counts better than say linear regression which assumes that the outcome variable is normally distributed
  2. However, the poisson regression model is restrictive as it assumes equi-dispersion in our outcome variable (i.e. mean of outcome variable approximately equals the variance of the outcome variable) When there is under- or over-dispersion, the poisson distribution no longer models the outcome variable well.
  3. When there is over-dispersion (variance of outcome > mean of outcome), negative binomial regression is often used
  4. When there is under-dispersion (i.e. variance of outcome < mean of outcome), we will use the Conway-Maxwell-Poisson regression model
  5. Useful resources: (i) Video explanations see1,2 (ii) Sellers & Premeaux3 explains the CMP regression and surveys available statistical packages

CMP distribution

  1. The Conway-Maxwell-Poisson distribution differs from the Poisson distribution with the introduction of a parameter \(\nu\)
  2. The \(\nu\) parameter enables the Conway-Maxwell-Poisson distribution to model a wide range of dispersion, and generalises well-known distributions
    1. When \(\nu = 1\), it takes on the Poisson distribution (equi-dispersion)
    2. When \(\nu = 0\) and \(\lambda < 1\), it takes on the geometric distribution (over-dispersion)
    3. When \(\nu \rightarrow \infty\), it takes on the bernoulli distribution (under-dispersion)

CMP regression

  1. Use of the Conway-Maxwell-Poisson (CMP) distribution in the generalised linear model (GLM) setting:
    1. In the original CMP formulation of Sellers and Shmueli4, the relationship between dependent variable \(Y\) and independent variables \(X\) is given by the equations:
      1. \(ln(\lambda_{i}) = X_{i}^T\beta\)
      2. \(\mathbb{E}(Y_{i}) = \lambda_{i} \frac{\partial ln Z(lambda_{i}, \nu)}{\partial \lambda_i} \approx \lambda_{i}^{\frac{1}{\nu}} - \frac{\nu - 1}{2\nu}\), where \(Z(\lambda_i, \nu_i)\) is a normalizing constant
  2. However, the CMP regression model is not amenable to interpretation of coefficients by mean contrasts, because \(\lambda_i\) is related to \(\mathbb{E}(Y_i)\) by a non-linear function
  3. To overcome this, Huang 20175 reparameterised the CMP distribution to model the mean directly. He called it CMP\(\mu\)

CMP\(\mu\)-regression

  1. Under the CMP\(\mu\)-regression formulation, the relationship between dependent variable \(Y\) and independent variables \(X\) is given by \(ln(\mathbb{E}(Y_i|X)) = X_{i}^T\beta\)
  2. The convenient thing about CMP-regression and CMP\(\mu\)-regression is that we can conduct a likelihood ratio test4,5 to see if a poisson regression model (poisson regression is a special case of CMP-regression when \(\nu = 1\)) is adequate (i.e. \(H_0: \nu = 1\) vs \(H_1: \nu \ne 1\))
  3. Note that \(H_1\) does not specify the direction (under vs over) of data dispersion. This can be assessed via the dispersion estimate \(\hat{\nu}\)4 (i.e. \(\hat{\nu} > 1\) if under-dispersion and \(\hat{\nu} < 1\) if over-dispersion)
  4. Model diagnostics5: If the underlying distributional model is correct, the probability inverse transformation (PIT) should resemble a random sample from a standard uniform distribution. Goodness-of-fit can then be assessed by plotting a histogram of the PIT, or via a quantile plot of the PIT against the uniform distribution

Running a CMP\(\mu\)-regression in R

We will use the mpcmp6 R package’s implementation of the CMP\(\mu\)-regression

Analysis: Concordance

What is the Kappa statistic?

  1. The Kappa statistic is a measure of “true” agreement. It indicates the proportion of agreement beyond that expected by chance7
  2. If prevalence index is high, chance agreement will be higher, kappa is reduced
  3. If bias index is high, chance agreement will be lower, kappa is higher

Calculation of Kappa statistic

  1. Calculation of prevalence-adjusted, bias-adjusted Kappa can be done through the epiR package

Analysis: Tests of reliability

Internal Consistency

  1. Cronbach’s alpha is a measure of internal consistency of questionnaire
  2. Interpreting Cronbach’s alpha
    1. Cronbach’s alpha between 0.70 to 0.80 is considered good enough8
    2. Cronbach’s alpha is shown as raw_alpha in the output
  3. Caveats:
    1. All else equal, alpha will increase as the number of items on the scale increases8 (page 799, section 17.8.2)
    2. If responses lack variability (i.e. almost everyone gives the same response), alpha tends to be poor8 (page 803)
    3. “If alpha is too high it may suggest that some items are redundant as they are testing the same question but in a different guise. A maximum alpha value of 0.90 has been recommended”9

Test-retest reliability

  1. We will use the Intraclass Correlation Coefficient to assess for test-retest reliability
  2. Out of the 10 variants of ICC, we will compute the “two-way mixed effects, absolute agreement, single rater/measurement” ICC variant or ICC (A,1) of McGraw and Wong10 per recommendations from11,12:
  3. When it comes to interpreting the ICC, according to Koo et al.11:
    1. ICC less than 0.5: poor reliability
    2. ICC between 0.5 and 0.75: moderate reliability
    3. ICC between 0.75 and 0.90: good reliability
    4. ICC greater than 0.90: excellent reliability

Analysis: Comparative effectiveness research

Background

  1. Confounding bias

Purpose of propensity score matching

  1. One of the methods to deal with confounding bias in retrospective observational study
  2. Compare the effectiveness of different treatment options13
  3. Compare the effectiveness of a health intervention programme vs usual care group14

How do we choose matching variables?

Analysis: Intervention effect estimation

Analysis: Missing data

Outline

  1. Why do we care about missing data?
  2. What are the types of missingness mechanisms?
  3. What are the implications of each missingness mechanism?
  4. Multiple imputation as a method for imputing missing data

Motivation: why do we care about missing data?

  1. Commonly used statistical models (e.g. GLM) require complete-cases \(\rightarrow\) forced to discard data points
  2. Data is expensive to collect
  3. In throwing out data, you will face the problem of:
    1. Efficiency: number of data points are reduced \(\rightarrow\) lose statistical power
    2. Bias: you might systematically exclude a population group and end up with a biased dataset
    3. Both bias and efficiency will influence your final conclusion

Missingness mechanism

  1. A missingness mechanism refers to the underlying process which generated the missing data
  2. There are 3 missingness mechanisms
    1. Missing completely at random (MCAR)
    2. Missing at random (MAR)
    3. Missing not at random (MNAR)

Framework for understanding missingness mechanism

  1. We will use the framework as laid out in Mohan & Pearl15
  2. Reason: Clearest way to discriminate between MCAR, MAR, MNAR
  3. Prerequisite: This requires an understanding of the concept of d-separation in graphical models

Framework notation

Suppose we have data in the following format15

  • \(A\): Age
  • \(G\): Gender
  • \(O\): Obesity
  • \(O^*\): Obesity
  • \(R_O\): Missing indicator
  1. Fully observed variables are shaded (e.g. \(A\), \(G\))
  2. Partially observed variables (with missing data) are non-shaded (e.g. \(O\))
  3. To model the missingness process, two variables are introduced:
    1. \(R_O\) - A binary {0, 1} masking variable that governs missingness
    2. \(O^*\) - A proxy variable for obesity that is fully observed \[ O^* = f(R_O, O) = \begin{cases} O & \text{if } R_O = 0 \\ m & \text{if } R_O = 1 \end{cases} \]

#1 Missing completely at random (MCAR)

  1. We say that the missing obesity data is MCAR if the cause of missingness is independent of any other variable (There are no edges between \(R_O\) and \(G, A, O\))
  2. Examples of how Obesity could be MCAR:
    1. Technical fault: eg. random connectivity downtime in the weighing machine that prevents the sending of patient’s weight to EMR
    2. Workflow: patient forgets to take his/her weight after questionnaire (and we could not call the patient to get his/her weight after)

Implications of MCAR

  1. Complete-case analysis \(\rightarrow\) unbiased dataset, though loss of statistical power due to fewer \(N\)
  2. Multiple imputation \(\rightarrow\) unbiased dataset
    1. Multiple imputation attempts to impute missing variables using a model of the partially observed variable (\(Y\)) on the observed variables (\(X\))
    2. Why unbiased? \(R_O \perp\!\!\!\!\!\!\perp O\) implies that \(P(O | A, G, R_O = 1) = P(O | A, G, R_O = 0)\)
    3. The distribution of missing obesity values (which we are trying to impute) is the same as the distribution of observed obesity values, given the other observed variables like age & gender
    4. So we can estimate the missing obesity values using a logistic regression model

#2 Missing at random (MAR)

  1. We say that the missing Obesity data is MAR if the cause of missingness is dependent on fully observed variable(s) eg. Age
  2. Conditional independence assertion:
    1. Missingness, \(R_O\), is conditionally independent of Obesity given Age i.e. \(R_O \perp\!\!\!\!\!\!\perp O | A\)
    2. Or you can think of it as: missingness, \(R_O\), is only dependent on Age
  3. Example of how Obesity could be MAR:
    1. A lot of missing weight data among teenagers who were rebellious and did not report their weight

Implications of MAR

  1. Complete-case analysis \(\rightarrow\) biased dataset
    1. Why biased? Data from teenage patients is excluded
  2. Multiple imputation \(\rightarrow\) unbiased dataset
    1. Why unbiased? Data from teenage patients is imputed and not excluded
    2. The conditional independence assertion above implies that \(P(O | A, G, R_O = 1) = P(O | A, G, R_O = 0)\)
    3. We can impute the missing data as the distribution of missing obesity values (which we are trying to impute) is the same as the distribution of observed obesity values, given the other observed variables like age & gender

#3 Missing not at random (MNAR)

  1. We say that the missing Obesity data is MNAR if the cause of missingness is dependent on the Obesity variable itself
  2. Example of how Obesity could be MNAR:
    1. A lot of missing weight data from obese patients who are embarrassed and did not disclose their weight
    2. In another setting, we often find that low/high income earners do not disclose their income

Implications of MNAR

  1. Complete case analysis \(\rightarrow\) biased dataset
    1. Why biased? Data from obese patients is excluded
  2. Multiple imputation \(\rightarrow\) biased dataset
    1. Why biased? The distribution of missing obesity values (which we are trying to impute) is different from the distribution of observed obesity values, given the other observed variables like age & gender i.e. \(P(O | A, G, R_O = 1) \ne P(O | A, G, R_O = 0)\)
    2. Imputing missing weight data for obese patients using data from patients who are not obese \(\rightarrow\) imputed weight tend to be lower than what it should actually be
  3. Imputation of missing data requires causal modelling of \(R\) variables. No pure statistical method of imputation exists15

Can these missingness mechanisms be tested for?

  1. MAR and MNAR cannot be tested for (see section 1.4 of Enders16)
    1. In each case, an independence assertion regarding the missing variable was made
    2. Since the missing variable is by definition partially unobserved, we cannot do a test for such an assertion where the input to the test is unobserved
  2. MCAR can be tested for (see section 1.9 of Enders17)
    1. There are testable propositions, the idea is to assume that the cases with missing data came from the same population and thus have the same means and covariances as the complete cases
    2. Little’s test is one test for MCAR that evaluates the mean differences across subgroups of cases that share the same missing data pattern
    3. The null hypothesis for Little’s test states that the data are MCAR, so a statistically significant test statistic provides evidence against the MCAR mechanism

Caveats about Little’s test

From section 1.9 of Enders17

  1. The test does not identify the specific variables that violate MCAR, so it is only useful for testing an omnibus hypothesis that is unlikely to hold in the first place
  2. The version of the test outlined above assumes that the missing data patterns share a common covariance matrix. MAR and MNAR mechanisms can produce missing data patterns with different variances and covariances, and the test statistic in Equation 1.4 would not necessarily detect covariance based deviations from MCAR
  3. Simulation studies suggest that Little’s test suffers from low power, particularly when the number of variables that violate MCAR is small, the relationship between the data and missingness is weak, or the data are MNAR. Consequently, the test has a propensity to produce Type II errors and can lead to a false sense of security about the missing data mechanism

Multiple imputation by chained equations (MICE)

Conclusion

  1. We have learnt about the 3 missingness mechanisms of MCAR, MAR, MNAR
  2. We have considered the implications of complete-case analysis and multiple imputation on bias & efficiency
  3. Challenge remains: a thorough handling of missing data requires causal modelling in order to:
    1. Distinguish between the 3 missingness mechanisms
    2. Impute missing data in MNAR scenario where common methods like complete-case analysis or multiple imputation are invalid

Presentation of results

Markdown

  1. Markdown is a lightweight markup language that you can use to add formatting elements to plaintext text documents. Created by John Gruber in 2004, Markdown is now one of the world’s most popular markup languages.1
  2. Try out the markdown live preview

R Markdown

  1. R Markdown is an extension to markdown that enables user to execute R code and display the code output in addition to text
  2. The knitr application enables us to execute R code and display output in a temporary markdown document
  3. Pandoc then converts the markdown document to the desired format (e.g. html, pdf, docx, pptx)
  4. See summary by Robin Linacre

Image from R Markdown cookbook chapter 2.1

[DEMO] Creation of an R markdown document

  1. Create an rmarkdown document

  2. Knit your document into html

Components of Rmarkdown document

  1. YAML header: controls the meta data e.g. Title, date, output document format, table of contents
  2. Formatted text of your descriptions, explanations etc.
  3. Code chunks: where you run R code e.g. plot graphs and display the output in the same document

Image taken from An Introduction to R

R markdown YAML header

---
title: "My project"
author: "CRU"
date: '`r format(Sys.Date(), "%d %b %Y")`'
output: 
  bookdown::html_document2:
    toc: true
    toc_float: true
bibliography: references.bib
csl: vancouver-superscript.csl 
link-citations: true
---
  1. This is a common setting which I use
  2. bookdown::html_document2 of the Bookdown package is a good output format because it takes care of formatting e.g. auto-numbering of sections, handling of references, appendix
  3. toc (line 7 & 8) refers to table of contents
  4. bibliography, csl and link-citations is for adding citations stored in a bibtex file into your document. More will be covered in a later slide

Consort diagram

  1. The consort package provides a convenient consort_plot function to plot CONSORT diagrams for reporting inclusion/exclusion/allocation (see the vignette)

    library(consort)
    flow_diagram <- consort_plot(data = penguins %>%
                                          mutate(id = row_number(),
                                                 exclusion = case_when(island == "Dream" ~ "Penguins on Dream island",
                                                                       year == 2007 ~ "Data collected in 2007",
                                                                       TRUE ~ NA_character_) %>%
                                                             fct_relevel("Penguins on Dream island",
                                                                         "Data collected in 2007")),
                                 orders = c(id = "Study population",
                                            exclusion = "Excluded",
                                            id = "Data for analysis"),
                                            side_box = "exclusion",
                                 cex = 1.2)
    plot(flow_diagram)

  2. For presenting in html documents, the plot function does not render the plot fully so when that happens, save it as an image first

    # Change the file path to where you want your image to be saved
    ggsave(here("images/consort_diagram.png"), plot = build_grid(flow_diagram))
  3. Next, load your image using knitr::include_graphics

    # You can use the fig.height or fig.width chunk option to rescale your image
    knitr::include_graphics(here("images/consort_diagram.png"))

Presenting tables

pacman::p_load(arsenal, knitr)

tableby(species ~ sex + island + bill_length_mm,
        data = penguins,
        control = tableby.control(numeric.stats = c("Nmiss", "meansd", "medianq1q3", "range"),
                                  digits = 2)) %>%
summary(text = TRUE) %>%
1knitr::kable(align = c("l", rep("r", 5)))
1
The kable function converts the table object to html
Adelie (N=152) Chinstrap (N=68) Gentoo (N=124) Total (N=344) p value
sex 0.976
- N-Miss 6 0 5 11
- female 73 (50.0%) 34 (50.0%) 58 (48.7%) 165 (49.5%)
- male 73 (50.0%) 34 (50.0%) 61 (51.3%) 168 (50.5%)
island < 0.001
- Biscoe 44 (28.9%) 0 (0.0%) 124 (100.0%) 168 (48.8%)
- Dream 56 (36.8%) 68 (100.0%) 0 (0.0%) 124 (36.0%)
- Torgersen 52 (34.2%) 0 (0.0%) 0 (0.0%) 52 (15.1%)
bill_length_mm < 0.001
- N-Miss 1 0 1 2
- Mean (SD) 38.79 (2.66) 48.83 (3.34) 47.50 (3.08) 43.92 (5.46)
- Median (Q1, Q3) 38.80 (36.75, 40.75) 49.55 (46.35, 51.08) 47.30 (45.30, 49.55) 44.45 (39.23, 48.50)
- Range 32.10 - 46.00 40.90 - 58.00 40.90 - 59.60 32.10 - 59.60

Formatting HTML tables

The kableExtra package provides useful functions to format tables e.g. changing column width, changing colors, font size, fixed header rows etc.

pacman::p_load(arsenal, knitr, kableExtra)

tableby(species ~ sex + island + bill_length_mm,
        data = penguins,
        control = tableby.control(numeric.stats = c("Nmiss", "meansd", "medianq1q3", "range"),
                                  digits = 2)) %>%
summary(text = TRUE) %>%
kable(align = c("l", rep("r", 5))) %>%
kableExtra::column_spec(column = 1, width_min = "4.75cm") %>%
kableExtra::column_spec(column = 2:5, width_min = "5.75cm") %>%
kableExtra::column_spec(column = 6, width_min = "2.25cm")
Adelie (N=152) Chinstrap (N=68) Gentoo (N=124) Total (N=344) p value
sex 0.976
- N-Miss 6 0 5 11
- female 73 (50.0%) 34 (50.0%) 58 (48.7%) 165 (49.5%)
- male 73 (50.0%) 34 (50.0%) 61 (51.3%) 168 (50.5%)
island < 0.001
- Biscoe 44 (28.9%) 0 (0.0%) 124 (100.0%) 168 (48.8%)
- Dream 56 (36.8%) 68 (100.0%) 0 (0.0%) 124 (36.0%)
- Torgersen 52 (34.2%) 0 (0.0%) 0 (0.0%) 52 (15.1%)
bill_length_mm < 0.001
- N-Miss 1 0 1 2
- Mean (SD) 38.79 (2.66) 48.83 (3.34) 47.50 (3.08) 43.92 (5.46)
- Median (Q1, Q3) 38.80 (36.75, 40.75) 49.55 (46.35, 51.08) 47.30 (45.30, 49.55) 44.45 (39.23, 48.50)
- Range 32.10 - 46.00 40.90 - 58.00 40.90 - 59.60 32.10 - 59.60

Formatting HTML table subheaders

If you added labels to your dataframe using the labelled package, it can be presented as subheaders via the labelTranslations argument in summary

labels <- list(sex = "Sex",
               island = "Island",
               bill_length_mm = "Bill length (mm)")
labelled::var_label(penguins) <- labels

tableby(species ~ sex + island + bill_length_mm,
        data = penguins,
        control = tableby.control(numeric.stats = c("Nmiss", "meansd", "medianq1q3", "range"),
                                  digits = 2)) %>%
1summary(text = TRUE, labelTranslations = map(labelled::var_label(penguins), ~ str_glue("**{.x}**"))) %>%
kable(align = c("l", rep("r", 5))) %>%
kableExtra::column_spec(column = 1, width_min = "5.5cm") %>%
kableExtra::column_spec(column = 2:5, width_min = "5.75cm") %>%
kableExtra::column_spec(column = 6, width_min = "2.25cm")
1
The ** will render the text bold in Rmarkdown html documents
Adelie (N=152) Chinstrap (N=68) Gentoo (N=124) Total (N=344) p value
**Sex** 0.976
- N-Miss 6 0 5 11
- female 73 (50.0%) 34 (50.0%) 58 (48.7%) 165 (49.5%)
- male 73 (50.0%) 34 (50.0%) 61 (51.3%) 168 (50.5%)
**Island** < 0.001
- Biscoe 44 (28.9%) 0 (0.0%) 124 (100.0%) 168 (48.8%)
- Dream 56 (36.8%) 68 (100.0%) 0 (0.0%) 124 (36.0%)
- Torgersen 52 (34.2%) 0 (0.0%) 0 (0.0%) 52 (15.1%)
**Bill length (mm)** < 0.001
- N-Miss 1 0 1 2
- Mean (SD) 38.79 (2.66) 48.83 (3.34) 47.50 (3.08) 43.92 (5.46)
- Median (Q1, Q3) 38.80 (36.75, 40.75) 49.55 (46.35, 51.08) 47.30 (45.30, 49.55) 44.45 (39.23, 48.50)
- Range 32.10 - 46.00 40.90 - 58.00 40.90 - 59.60 32.10 - 59.60

Cross-references & citations

  1. See cross-references and citations on how to add internal cross-references and citations from a bibtex file
  2. See bibliographies and citations on how to add a references and appendix section

References

1.
VideoLecturesChannel. A Flexible Model for Count Data: The COM-Poisson Distribution. 2012.
2.
Consortium R. Observation driven Conway-Maxwell Poisson count data models. 2018.
3.
Sellers KF, Premeaux B. Conway–MaxwellPoisson regression models for dispersed count data. WIREs Computational Statistics. 2020;13(6):e1533.
4.
Sellers KF, Shmueli G. A flexible regression model for count data. The Annals of Applied Statistics. 2010;4(2):943–61, 19.
5.
6.
Fung T, Alwan A, Wishart J, Huang A. Mpcmp: Mean-Parametrizied Conway-Maxwell Poisson Regression. 2020.
7.
Sim J, Wright CC. The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements. Physical Therapy. 2005 Mar;85(3):257–68.
8.
Field A, Miles J, Field Z. Discovering Statistics Using R. SAGE Publications; 2012.
9.
Tavakol M, Dennick R. Making sense of Cronbach’s alpha. Int J Med Educ [Internet]. 2011/06/27 ed. 2011 Jun;2:53–5. Available from: https://www.ncbi.nlm.nih.gov/pubmed/28029643
10.
McGraw KO, Wong SP. Forming inferences about some intraclass correlation coefficients. Psychological Methods. 1996;1(1):30–46.
11.
Koo TK, Li MY. A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research. J Chiropr Med [Internet]. 2016/06/23 ed. 2016 Jun;15(2):155–63. Available from: https://www.ncbi.nlm.nih.gov/pubmed/27330520
12.
Qin S, Nelson L, McLeod L, Eremenco S, Coons SJ. Assessing test-retest reliability of patient-reported outcome measures using intraclass correlation coefficients: Recommendations for selecting and documenting the analytical formula. Qual Life Res [Internet]. 2018/12/14 ed. 2019 Apr;28(4):1029–33. Available from: https://www.ncbi.nlm.nih.gov/pubmed/30547346
13.
Jaksa A, Gibbs L, Kent S, Rowark S, Duffield S, Sharma M, et al. Using primary care data to assess comparative effectiveness and safety of apixaban and rivaroxaban in patients with nonvalvular atrial fibrillation in the UK: An observational cohort study. BMJ Open [Internet]. 2022/10/18 ed. 2022 Oct;12(10):e064662. Available from: https://www.ncbi.nlm.nih.gov/pubmed/36253039
14.
Lu AD, Gunzburger E, Glorioso TJ, Smith WB 2nd, Kenney RR, Whooley MA, et al. Impact of Longitudinal Virtual Primary Care on Diabetes Quality of Care. J Gen Intern Med [Internet]. 2021/01/24 ed. 2021 Sep;36(9):2585–92. Available from: https://www.ncbi.nlm.nih.gov/pubmed/33483815
15.
Mohan K, Pearl J. Graphical Models for Processing Missing Data. Journal of the American Statistical Association. 2021;116(534):1023–37.
16.
Enders CK. Applied Missing Data Analysis. Guilford Publications; 2022.
17.
Enders CK. Applied Missing Data Analysis. Guilford Publications; 2010.